PyDrill Demonstration

This notebook demonstrates how to use the PyDrill module to connect to Apache Drill and query data. The complete documentation for PyDrill can be found at http://pydrill.readthedocs.io

The essential steps are:

  1. Import the module
  2. Open a connection to Drill
  3. Execute a query
  4. Do something with the results.

You will first need to install PyDrill. This can be done by opening a terminal and typing:

pip install pydrill

Step 1: Import the PyDrill module

After you've done this, you will be able to import the PyDrill module.


In [1]:
from pydrill.client import PyDrill

Step 2: Open a connection to Drill

The next step is to open a connection to Drill. Once you've opened the connection, you will want to verify that the connection was successfully opened before executing any queries. PyDrill includes an is_active() method for this purpose.


In [2]:
#Open a connection to Drill
drill = PyDrill(host='localhost', port=8047)

#Verify the connection is active, throw an error if not.
if not drill.is_active():
    raise ImproperlyConfigured('Please run Drill first')

Step 3: Execute a query and get the results

The next and final step is to execute a query in Drill. When you call the .query() method, PyDrill returns an iterable object from which you can extract the rows of your results. You can also get PyDrill to return a pandas DataFrame.


In [3]:
#Execute query in Drill
query_result = drill.query('''
 SELECT JobTitle, 
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS avg_salary, 
COUNT( DISTINCT name ) AS number
FROM dfs.drillworkshop.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
LIMIT 10
''')

#Iterate through the rows.
for row in query_result:
    print( row )


{'avg_salary': '238772.0', 'number': '1', 'JobTitle': "STATE'S ATTORNEY"}
{'avg_salary': '211785.0', 'number': '1', 'JobTitle': 'Police Commissioner'}
{'avg_salary': '178900.0', 'number': '1', 'JobTitle': 'Executive Director V'}
{'avg_salary': '167449.0', 'number': '1', 'JobTitle': 'MAYOR'}
{'avg_salary': '166500.0', 'number': '1', 'JobTitle': 'CITY SOLICITOR'}
{'avg_salary': '166500.0', 'number': '1', 'JobTitle': 'DIRECTOR PUBLIC WORKS'}
{'avg_salary': '166401.66666666666', 'number': '9', 'JobTitle': 'Executive Director III'}
{'avg_salary': '159800.0', 'number': '1', 'JobTitle': 'CITY AUDITOR'}
{'avg_salary': '146891.5', 'number': '4', 'JobTitle': 'Executive Director IV'}
{'avg_salary': '146500.0', 'number': '3', 'JobTitle': 'Assistant Fire Chief'}

Retrieving a DataFrame

You can also get PyDrill to directly return a DataFrame by using the .to_dataframe() method of the results object.


In [5]:
df = query_result.to_dataframe()
df.head()


Out[5]:
JobTitle avg_salary number
0 STATE'S ATTORNEY 238772.0 1
1 Police Commissioner 211785.0 1
2 Executive Director V 178900.0 1
3 MAYOR 167449.0 1
4 CITY SOLICITOR 166500.0 1

In Class Exercise:

Using the data in the dailybots.csv file use Drill to:

  1. Query the file to produce a summary of infections by day.
  2. Store this data in a dataframe using the to_dataframe() method.
  3. Create a line plot of this data by calling the .plot() method on the dataframe

In [ ]: